Logistic regression

In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logistic function. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.

1 Definition
2 Sample size-dependent efficiency
3 Example
4 Formal mathematical specification
5 Extensions
6 Model accuracy
7 See also
8 References
9 External links

Definition

An explanation of logistic regression begins with an explanation of the logistic function, which, like probabilities, always takes on values between zero and one:

$f(z) = \frac{e^{z}}{e^{z} %2B 1} \! = \frac{1}{1 %2B e^{-z}} \!$

A graph of the function is shown in figure 1. The input is z and the output is ƒ(z). The logistic function is useful because it can take as an input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. The variable z represents the exposure to some set of independent variables, while ƒ(z) represents the probability of a particular outcome, given that set of explanatory variables. The variable z is a measure of the total contribution of all the independent variables used in the model and is known as the logit.

The variable z is usually defined as

$z=\beta_0 %2B \beta_1x_1 %2B \beta_2x_2 %2B \beta_3x_3 %2B \cdots %2B \beta_kx_k,$

where $\beta_0$ is called the "intercept" and $\beta_1$ , $\beta_2$ , $\beta_3$ , and so on, are called the "regression coefficients" of $x_1$ , $x_2$ , $x_3$ respectively. The intercept is the value of z when the value of all independent variables are zero (e.g. the value of z in someone with no risk factors). Each of the regression coefficients describes the size of the contribution of that risk factor. A positive regression coefficient means that the explanatory variable increases the probability of the outcome, while a negative regression coefficient means that the variable decreases the probability of that outcome; a large regression coefficient means that the risk factor strongly influences the probability of that outcome, while a near-zero regression coefficient means that that risk factor has little influence on the probability of that outcome.

Logistic regression is a useful way of describing the relationship between one or more independent variables (e.g., age, sex, etc.) and a binary response variable, expressed as a probability, that has only two values, such as having cancer ("has cancer" or "doesn't have cancer") .

Sample size-dependent efficiency

Logistic regression tends to systematically overestimate odds ratios or beta coefficients when the sample size is less than about 500. With increasing sample size, the magnitude of overestimation diminishes and the estimated odds ratio asymptotically approaches the true population value. In a single study, overestimation due to small sample size might not have any relevance for the interpretation of the results, since it is much lower than the standard error of the estimate. However, if a number of small studies with systematically overestimated effects are pooled together without consideration of this effect, an effect may be perceived when in reality it does not exist.^[1]

A minimum of 10 events per independent variable has been recommended.^[2]^[3] For example, in a study where death is the outcome of interest, and 50 of 100 patients die, the maximum number of independent variables the model can support is 50/10 = 5.

Example

The application of a logistic regression may be illustrated using a fictitious example of death from heart disease. This simplified model uses only three risk factors (age, sex, and blood cholesterol level) to predict the 10-year risk of death from heart disease. These are the parameters that the data fit:

$\beta_0=-5.0 \text{ (the intercept)}$

$\beta_1=%2B2.0$

$\beta_2=-1.0$

$\beta_3=%2B1.2$

$x_1=\text{ age in years, above 50}$

$x_2=\text{ sex, where 0 is male and 1 is female}$

$x_3=\text{ cholesterol level, in mmol/L above 5.0}$

The model can hence be expressed as

$\text{risk of death} = \frac{1}{1%2Be^{-z}} \text{, where } z=-5.0 %2B2.0x_1 -1.0x_2 %2B 1.2x_3.$

In this model, increasing age is associated with an increasing risk of death from heart disease (z goes up by 2.0 for every year over the age of 50), female sex is associated with a decreased risk of death from heart disease (z goes down by 1.0 if the patient is female), and increasing cholesterol is associated with an increasing risk of death (z goes up by 1.2 for each 1 mmol/L increase in cholesterol above 5 mmol/L).

We wish to use this model to predict a particular subject's risk of death from heart disease: he is 50 years old and his cholesterol level is 7.0 mmol/L. The subject's risk of death is therefore

$\frac{1}{1%2Be^{-z}} \text{, where } z=-5.0 %2B (%2B2.0)(50-50) %2B (-1.0)0 %2B (%2B1.2)(7.0-5.0).$

This means that by this model, the subject's risk of dying from heart disease in the next 10 years is 0.07 (or 7%).

Formal mathematical specification

Logistic regression analyzes binomially distributed data of the form

$Y_i \ \sim B(n_i,p_i),\text{ for }i = 1, \dots , m,$

where the numbers of Bernoulli trials n_i are known and the probabilities of success p_i are unknown. An example of this distribution is the fraction of seeds (p_i) that germinate after n_i are planted.

The model proposes for each trial i there is a set of explanatory variables that might inform the final probability. These explanatory variables can be thought of as being in a k-dimensional vector X_i and the model then takes the form

$p_i = \operatorname{E}\left(\left.\frac{Y_i}{n_{i}}\right|X_i \right). \,$

The logits, natural logs of the odds, of the unknown binomial probabilities are modeled as a linear function of the X_i.

$\operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 %2B \beta_1 x_{1,i} %2B \cdots %2B \beta_k x_{k,i}.$

Note that a particular element of X_i can be set to 1 for all i to yield an intercept in the model. The unknown parameters β_j are usually estimated by maximum likelihood using a method common to all generalized linear models. The maximum likelihood estimates can be computed numerically by using iteratively reweighted least squares.

The interpretation of the β_j parameter estimates is as the additive effect on the log of the odds for a unit change in the jth explanatory variable. In the case of a dichotomous explanatory variable, for instance gender, $e^\beta$ is the estimate of the odds of having the outcome for, say, males compared with females.

The model has an equivalent formulation

$p_i = \frac{1}{1%2Be^{-(\beta_0 %2B \beta_1 x_{1,i} %2B \cdots %2B \beta_k x_{k,i})}}. \,$

This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of p_i with respect to X = x₁...x_k is computed from the general form:

$y = \frac{1}{1%2Be^{-f(X)}}$

where f(X) is an analytic function in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:

$\frac{\mathrm{d}y}{\mathrm{d}X} = y(1-y)\frac{\mathrm{d}f}{\mathrm{d}X}. \,$

Extensions

Extensions of the model cope with dependent variables with more than two values, also called polytomous regression. Ordered logistic regression handles ordinal dependent variables (ordered values). Multinomial logistic regression handles nominal dependent variables (unordered values, also called "classification"). An extension of the logistic model to sets of interdependent variables is the conditional random field.

Model accuracy

A way to test for errors in models created by step-wise regression is to not rely on the model's F-statistic, significance, or multiple-r, but instead assess the model against a set of data that was not used to create the model.^[4] The class of techniques is called cross-validation.

Accuracy is measured as correctly classified records in the holdout sample.^[5] There are four possible classifications:

prediction of 0 when the holdout sample has a 0 (True Negative/TN)
prediction of 0 when the holdout sample has a 1 (False Negative/FN)
prediction of 1 when the holdout sample has a 0 (False Positive/FP)
prediction of 1 when the holdout sample has a 1 (True Positive/TP)

These classifications are used to measure Precision and Recall:

$\text{Precision}=\frac{tp}{tp%2Bfp} \,$

$\text{Recall}=\frac{tp}{tp%2Bfn} \,$

The percent of correctly classified observations in the holdout sample is referred to the assessed model accuracy. Additional accuracy can be expressed as the model's ability to correctly classify 0, or the ability to correctly classify 1 in the holdout dataset. The holdout model assessment method is particularly valuable when data are collected in different settings (e.g., at different times or places) or when models are assumed to be generalizable.

References

^ Nemes S, Jonasson JM, Genell A, Steineck G. 2009 Bias in odds ratios by logistic regression modelling and sample size. BMC Medical Research Methodology 9:56 BioMedCentral
^ Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR (1996). "A simulation study of the number of events per variable in logistic regression analysis". J Clin Epidemiol 49 (12): 1373–9. PMID 8970487.
^ Agresti A (2007). "Building and applying logistic regression models". An Introduction to Categorical Data Analysis. Hoboken, New Jersey: Wiley. p. 138. ISBN 978-0-471-22618-5.
^ Jonathan Mark and Michael A. Goldberg (2001). Multiple Regression Analysis and Mass Assessment: A Review of the Issues. The Appraisal Journal, Jan. pp. 89–109
^ Mayers, J.H and Forgy E.W. (1963). The Development of numerical credit evaluation systems. Journal of the American Statistical Association, Vol.58 Issue 303 (Sept) pp 799–806

Agresti, Alan. (2002). Categorical Data Analysis. New York: Wiley-Interscience. ISBN 0-471-36093-7.
Amemiya, T. (1985). Advanced Econometrics. Harvard University Press. ISBN 0-674-00560-0.
Balakrishnan, N. (1991). Handbook of the Logistic Distribution. Marcel Dekker, Inc.. ISBN 978-0824785871.
Greene, William H. (2003). Econometric Analysis, fifth edition. Prentice Hall. ISBN 0-13-066189-9.
Hilbe, Joseph M. (2009). Logistic Regression Models. Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5.
Hosmer, David W.; Stanley Lemeshow (2000). Applied Logistic Regression, 2nd ed.. New York; Chichester, Wiley. ISBN 0-471-35632-8.

External links

Statistics

Descriptive statistics

Continuous data

Location	Mean (Arithmetic, Geometric, Harmonic) Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Designing studies	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling Stratified sampling Opinion poll Questionnaire

Controlled experiment	Design of experiments Randomized experiment Random assignment Replication Blocking Factorial experiment Optimal design

Uncontrolled studies	Natural experiment Quasi-experiment Observational study

Statistical inference

Statistical theory	Sampling distribution Sufficient statistic Meta-analysis

Bayesian inference	Bayesian probability Prior Posterior Credible interval Bayes factor Bayesian estimator Maximum posterior estimator

Frequentist inference	Confidence interval Hypothesis testing Likelihood-ratio

Specific tests	Z-test (normal) Student's t-test F-test Pearson's chi-squared test Wald test Mann–Whitney U Shapiro–Wilk Signed-rank Kolmogorov–Smirnov test

General estimation	Bias Robustness Efficiency Maximum likelihood Method of moments Minimum distance Density estimation

Correlation and regression analysis

Correlation	Pearson product-moment correlation Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust

Generalized linear model	Exponential families Logistic (Bernoulli) Binomial Poisson

Partition of variance	Analysis of variance (ANOVA) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical, multivariate, time-series, or survival analysis

Categorical data	Cohen's kappa Contingency table Graphical model Log-linear model McNemar's test

Multivariate statistics	Multivariate regression Principal components Factor analysis Cluster analysis Copulas

Time series analysis	Decomposition (Trend, Stationary process) ARMA model ARIMA model Vector autoregression Spectral density estimation

Survival analysis	Survival function Kaplan–Meier Logrank test Failure rate Proportional hazards models Accelerated failure time model

Applications

Biostatistics	Bioinformatics Biometrics Clinical trials & studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process & Quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Outline
Index